ci: add DBR LTS install check to catch ES-1960554-class regressions#843
Merged
Conversation
5040862 to
f4fd300
Compare
gopalldb
approved these changes
Jul 2, 2026
1e96f3a to
0b1ad22
Compare
0b1ad22 to
f125819
Compare
…554) The thrift 0.23.0 bump (PR #796, shipped in 4.2.7) broke `pip install` on DBR LTS: thrift ships sdist-only and 0.23.0's setup.py calls sys.exit(0) on the build-success path, killing the PEP 517 backend before pip writes output.json. On the old setuptools shipped by DBR 14.3/15.4 LTS this is a hard install failure (SEV0 ES-1960554); 4.2.7 was yanked and reverted (#840). Our CI never caught it because every job installs via `poetry install` on a modern runner -- it never does a fresh `pip install` of the built wheel on an LTS toolchain, the real customer path that failed. CI check -------- Adds a PR check (gated to dependency changes) that builds the wheel and installs it INSIDE real DBR LTS clusters via the PECO workspace Jobs API (no PyPI publish) then runs a SELECT 1 smoke test. Matrix = supported LTS {13.3, 14.3, 15.4, 16.4, 17.3} x install target {base, pyarrow, kernel}. Auth is OAuth M2M as the PECO service principal throughout (driver -> workspace API and the notebook's connector -> warehouse smoke query); a PAT is warehouse-scoped and rejected by the workspace REST API. Older LTS ship an SDK too old for auth_type=oauth-m2m, so the smoke harness upgrades databricks-sdk. Per-run artifacts are cleaned up in a finally block. Connector fix (caught by the check) ----------------------------------- The check surfaced a real latent bug: a base install (no [pyarrow] extra) runs against a runtime's bundled pyarrow, and on DBR 13.3/14.3 that pyarrow predates the `promote_options` kwarg, so concat_table_chunks raised `TypeError: concat_tables() got an unexpected keyword argument 'promote_options'` on the Arrow result path. utils.py now falls back to the legacy `promote=True` (equivalent to promote_options="default") when the kwarg is unsupported, with a regression test. Validated end-to-end against the PECO workspace: green on thrift 0.22.0, and re-widening the pin to <0.24.0 fails on 14.3+15.4 with the exact output.json error -- a true guard, not a check that always passes. Also adds an incident-linked comment on the thrift pin so nobody re-widens it before the upstream fix (THRIFT-6067 / apache/thrift#3584) ships. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
…BRICKS_USER The e2e suite connected via a PAT (DATABRICKS_TOKEN). The Personal Staging Location tests PUT/GET/REMOVE against stage://tmp/<DATABRICKS_USER>/..., where DATABRICKS_USER is the PECO service principal (TEST_PECO_SP_ID). A personal stage is identity-scoped by design (there is even a test asserting you cannot touch another user's stage), so the connecting identity MUST equal DATABRICKS_USER. When DATABRICKS_TOKEN authenticates as a different identity, those tests fail with `PERMISSION_DENIED: <user> does not have access to Personal Stage`. Switch the e2e connection to OAuth M2M as the service principal via credentials_provider (conftest.auth_connect_kwargs), so the connecting identity IS the SP == DATABRICKS_USER. Falls back to the PAT when SP OAuth creds aren't set, so local PAT runs are unaffected. Wires DATABRICKS_CLIENT_ID / DATABRICKS_CLIENT_SECRET (TEST_PECO_SP_ID / TEST_PECO_SP_OAUTH_SECRET, already in azure-prod) into code-coverage.yml. Verified locally against the PECO workspace: all 9 staging_ingestion e2e tests pass via the real M2M path (including fails_to_modify_another_staging_user, which validates the identity scoping). Kernel e2e files are unchanged (they run in kernel-e2e.yml, ignored by code-coverage.yml). Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
The e2e M2M auth (conftest.auth_connect_kwargs) imports databricks.sdk.core.oauth_service_principal, but databricks-sdk was not a project dependency, so `poetry install` in code-coverage.yml didn't provide it -- every e2e connection failed with `ModuleNotFoundError: No module named 'databricks.sdk'`. Add it to the dev group (test-only; not a runtime dep of the connector). CI's setup-poetry runs `poetry lock` before install, so the lockfile is regenerated on the runner. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
The azure-prod DATABRICKS_TOKEN is now a personal access token owned by the PECO service principal, so its identity matches DATABRICKS_USER. That fixes the Personal Staging Location tests (stage://tmp/<SP>/...) with a plain PAT, without the OAuth M2M machinery -- which also broke the retry/HTTP tests, since M2M makes a live token-endpoint call that those tests' urllib3 mocking intercepts. Reverts the e2e auth changes (conftest.auth_connect_kwargs + the consumer call sites + the databricks-sdk dev dep + the code-coverage.yml SP env) back to the plain access_token path. The DBR LTS install check keeps OAuth M2M: it hits the workspace Jobs/SCIM API (which rejects a warehouse-scoped PAT), is proven 15/15 green on M2M, and installs databricks-sdk itself in its own workflow. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a DBR LTS Install CI check that builds the connector wheel and installs it inside real DBR LTS clusters (via the PECO workspace Jobs API — no PyPI publish needed), then runs a
SELECT 1smoke test. Matrix = supported LTS {13.3, 14.3, 15.4, 16.4, 17.3} × install target {base, pyarrow, kernel}.Also adds an incident-linked comment on the
thriftpin inpyproject.tomlso nobody re-widens it before the upstream fix ships.Why
The thrift 0.23.0 bump (PR #796, shipped in 4.2.7) broke
pip installon DBR LTS — SEV0 ES-1960554. thrift ships sdist-only, and 0.23.0'ssetup.pycallssys.exit(0)on the build-success path, killing the PEP 517 backend before pip writesoutput.json. On the old setuptools shipped by DBR 14.3/15.4 LTS this is a hard install failure. 4.2.7 was yanked and the bump reverted (PR #840).Our CI never caught it because every job installs via
poetry installon a modern runner — it never does a freshpip installof the built wheel on an LTS toolchain, which is the real customer path that failed. This PR closes exactly that gap.How it works
Per matrix leg,
scripts/dbr_lts_install_check.py(driver, runs on the GH runner):scripts/dbr_lts_smoke_notebook.pyinto the workspace,spark_version; the notebookpip installs the wheel (+ extras) and runsSELECT 1,finally(every exit path).Several non-obvious DBR-cluster constraints are baked in and commented (notebook_task not spark_python_task-from-Volume; SINGLE_USER access mode for UC/Volume access;
dbutils.fs.cpthe wheel off/Volumes;dbutils.library.restartPython()after install).Gating
Runs on
pull_request, but the cluster matrix runs only when dependency-affecting files change (pyproject.toml/poetry.lock/ this workflow / the two scripts) — the only surface that can introduce this failure class. Informational check (not a required merge-queue gate). Uses only secrets already present in theazure-prodenvironment (DATABRICKS_HOST,DATABRICKS_TOKEN,TEST_PECO_WAREHOUSE_HTTP_PATH); the notebook-import dir is derived from the token's own identity.Validation
Exercised end-to-end against the PECO workspace:
SELECT 1pass on 15.4 for bothbaseand[pyarrow].<0.24.0fails on 14.3 and 15.4 with the exact incident error (Downloading thrift-0.23.0.tar.gz→OSError ... output.json) — a true guard, not a check that always passes.Follow-ups (external / not in this PR)
azure-prodDATABRICKS_TOKENcan create job clusters and write to thepeco.default.ci_wheelsVolume.This pull request and its description were written by Isaac.